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Abstract:- Utterances used by a human while framing a response during the interaction with a 
software agent like Spoken Dialogue system(SDS) carries a lot of valuable information with regards to 
the internal mental state of the user. Opinion mining is an analysis of the mental state of a person, his 
opinion, appraisal or emotion towards an event, entities or their attributes. The users level of certainty 
about a topic could be determined by the analysis not only of the text used in the utterance but by 
studying the prosody information structure. Prosody reveals Information about the context by 
highlighting information structure and aspects of the speaker hearer relationship. Most often it is 
observed that the speaker's internal state is not depicted by the words he uses but by the tone of his 
utterance or facial expression of the user. In this paper we had analyzed a sample of student's after a 
lecture on operating system's subject and used the prosodic features of each dialogue with the student 
based on few questions. It was determined whether the students were certain, uncertain or neutral 
about their understanding of the lecture contents. This paper uses PRAAT a software tool for speech 
analysis which uses 15 acoustic features to determine the certainty of the responses of the user 
through classification by RAPIDMINER based on the prosody information which will actually aid the 
dialogue management component of the SDS in framing a better dialogue strategy. 

Keywords:- Uncertainty handling, Prosody information, Spoken language understanding, Machine 
learning. 

I. INTRODUCTION 

Spoken language is an intuitive form of interaction between humans and computers. Spoken Language 
Understanding has been a challenge in the design of the Spoken Dialogue System where the intention of the 
speaker has to be identified from the words used in his utterances. Typically a spoken dialogue system 
comprises a four main components an automatic speech recognition system (ASR), Spoken language 
understanding component (SLU), Dialogue manager (DM) and an Speech synthesis system which converts the 
text to speech (TTS). Spoken Language understanding deals with understanding the intent of the speaker from 
the words used by him in his utterances. The accuracy of the speech recognition system is questionable and 
researchers have provided various solutions to the problem of automatic speech recognition which lagged 
behind human performance [2], [3] there have been some notable recent advances in discriminative training [4]; 
e.g., maximum mutual information (MMI) estimation [5], minimum classification error (MCE) training [6], [7], 
and minimum phone error (MPE) training [8], [9]), in large-margin techniques (such as large margin estimation 
[10], [11], large margin hidden Markov model (HMM) [12], large-margin MCE [13]— [15], and boosted MMI 
[16]), as well as in novel acoustic models (such as conditional random fields (CRFs) [17]— [19], hidden CRFs 
[20], [21] and segmental CRFs [22]),training densely connected, directed belief nets with many hidden layers 
which learn a hierarchy of nonlinear feature detectors that can capture complex statistical patterns in data [23]. 

There are many cases of experiences by the users when the computers either do not understand the 
intended meaning of the user even after correctly recognizing the spoken utterances. One of the reason may be 
that in a face to face human conversation, there are contextual, audio and visual cues [1] which aid the 
knowledge requirements of the users for the efficient communication as the users other than hearing the 
utterances are able to sense the mood and tone of the user by which they come to know whether the speaker is 
certain or not. This is, absent in a dialogue between a computer and a human because in many potential 
applications there is only audio input and no video input. If the Spoken Dialogue Systems are improved to use 
the prosodic information from the spoken utterance they will definitely benefit from the level of certainty of the 
user [24] such as spoken tutorial dialogue systems [25], language learning systems [26] and voice search 
applications [27]. 

1 



Opinion Mining Using The Prosodic Information In The Spoken... 



Our primary goal is to make use of prosodic information for aiding the dialogue manager in selecting 
the dialogue strategy for effective interaction and influencing the final outcome. Technically Prosody is defined 
as the rhythm, stress, and intonation of speech which reflect various features such as emotional state of the 
speaker, the form of the utterance (statement, question, or command, the presence of irony or sarcasm, 
emphasis, contrast, and focus or other elements of language) that may not be encoded by grammar or choice of 
vocabulary. Prosodic information of an utterance can be used to determine how certain a speaker is and hence 
the internal state of mind [28] which can be used for tasks from detecting frustration[29], to detecting flirtation 
[30] and other intentions. The model proposed that uses prosodic information to classify utterances has 
effectively colored the system responses in a student evaluation information system and performed better than a 
trivial non-prosodic baseline model. 

In the context of human computer interaction, the study of prosodic information has been aimed at 
extracting mood features in order to be able to dynamically adapt a dialog strategy by the automatic Spoken 
Dialogue System. 

II. CORPUS AND CERTAINITY ANNOTATION 

It is very important to understand that not only what words are spoken by a speaker in his utterance but 
how the words are spoken along with the certainty factor can actually guide the dialogue process between the 
machine and the user. The spoken utterance may be perceived as uncertain, certain, neutral or mixed which 
helps the dialogue system to make a guess about the mental state of the user about the utterance or about the 
concept about which he is speaking about. In this paper we examine the impact of a lecture on operating system 
subject on the understanding of the students as it is expressed within the context of a spoken dialogue. 

AGENT : What is an operating system. 

STUDENT: It is a set of software or hardware may be (UNCERTAIN) 

AGENT : Is it hardware or software. 

STUDENT: Software(CERTAIN) 

AGENT : What is the main function of Operating System. 

STUDENT: To provide a interface between user and machine(CERTAIN) 

AGENT : What do you know about round robin scheduling. 

STUDENT: Uh-uhh (NEUTRAL) 

Fig 1. An annotated excerpt from the student corpus. 

A corpus of 15 lecture related dialogs are selected and after listening each sentence of the student is 
labeled by an annotator with either certain or uncertain or neutral. The dialog were also lexically annotated 
based on the words used as certain, uncertain and neutral. The percentage of corpus with certainty, uncertainty 
and neutral for the auditory and lexical conditions as annotated by listening to the audio of the dialog context 
and annotated based on the lexical structure of the dialogues are shown in the Table I. 



Table I : Percentage of Corpus with different levels of certainty. 



CONDITIO 

N 


LEVEL 


Certain 


Un- 
certain 


Neutral 


Auditory 


22.3% 


18.4% 


59.3% 


Lexical 


12.1% 


11.7% 


76.2% 



It was observed that 40.7% non-neutral corpus could be decided as certain or uncertain based on the 
audio and the dialog context compared to the 23.8% based on the lexical information. As such we used the 
acoustic -prosody features for further information about the certainty or uncertainty. 



III. METHODOLOGY 

For the basic model we compute values for 15 prosodic features as given in the Table I for each 
utterance in the corpus of student lecture data set using PRAAT ( a program for speech analysis and synthesis) 
[34] and Wavesurfer for extracting the fO contour. Feature values are represented as z-scores normalized by 
speaker. The temporal features like voice breaks, unvoiced frames, degree of voice breaks, Total duration are 
not normalized. The set of features were selected in order to be comparable with Liscombe et al [31] who used 
the same features along with turn related features for classifying uncertainty. 
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TABLE II 

EXTRACTED AND SELECTED FEATURES. 



NO. OF 
FEATURES 


FEATURES 










6 


Minimum, Maximum and Standard 
deviation, relative position min fO, 
relative position max fO statistics of 
fundamental frequency (fO) Pitch 


4 


Minimum, Maximum, Mean 
and Standard Deviation (RMS), statistii 
of Intensity 


1 


Ratio of voiced frames to total frames i 
the speech signal as an approximation o 
speaking rate 


2 


Total silence, Percent silence. 


2 


Speaking duration, Total duration. 



IV. RESULTS 

The features extracted are used as input variables to RAPID MINER machine learning software which 
built C4.5 decision tree models that iteratively builds weak models and then combines them to form a better 
model to predict the classification of unseen data. As an initial model we train a single decision tree using the 
selected 15 features as listed in Table II. The model was evaluated over all the utterances of the corpus and it 
classified within the classification classes, certain, uncertain and neutral and cross validated with an accuracy of 
65% as compared to the non-prosodic model which had a an classification accuracy of 51.1%. 
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Fig. 2. Decision tree obtained after data discretisation. 
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Fig. 2. A Plot offO and Output. 
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Fig. 2. Undiscretised Data Decision Tree. 
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Fig. 4. The Performance Statistics 

V. CONCLUSION 

In human computer interaction, the computer have to act human like so that other than the lexical 
information, computer should be able to utilize the auditory and visual cues so that the users are responded in a 
manner which is based on their emotions and the system looks to be more user friendly. In an automated 
performance grading system students should be graded, selecting few based on his information and can help the 
automated SGS to design a course layout which is more based on the preferences and prosodic information of 
the user. When the system talks about certain concepts the prosodic features can indicate how much certain the 
student is about the terms. Thus prosodic information provides information regarding the internal state of mind 
of the user and would help the dialogue manager to dynamically select the strategy based on the certainty or 
uncertainty. 

In our experiment we used a small set of prosodic features that have been examined in related work by 
other researchers. Using and expanded set of features would improve the results and the accuracy with which 
the certainty can be detected. In the future work we would be using the visual cues like facial expressions, body 
language, emotions and other inputs by a human to maximize the ability to determine the internal mental state of 
the user which can give the spoken dialogue system a mechanism to select dynamic dialogue strategy. 
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